This case study presents a data analysis project conducted as part of the Google Data Analytics Professional Certificate course, Capstone Project, focusing on the usage of Bellabeat Smart Devices. Bellabeat is a high-tech manufacturer of health-focused smart products designed specifically for women.
The company was founded in 2013 by Urška Sršen and Sando Mur and has expanded quickly since, now with the possibility to become a greater player in the global smart device market. Our team have been asked to analyze smart device data to gain insight into how consumers are using their smart devices. The insights we discover will then help guide marketing strategy for the company.
Currently, company has 5 key product/services offerings:
BUSINESS TASK: Identify trends in how consumers use non-Bellabeat smart devices to gain insight and help guide marketing strategy for Bellabeat to grow as a global player.
Key Stakeholders:
· Urška Sršen — Bellabeat’s cofounder and Chief Creative Officer
· Sando Mur — Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
· Bellabeat marketing analytics team — A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy.
Data Source: FitBit Fitness Tracker Data
Data Organisation This data set contains personal fitness tracker from thirty fitbit users provided with 18 .CSV files. Data is organized in a long and wide format. The file contains detailed information about daily activity, sleep, weight, calories, and intensities.
The FitBit Fitness Tracker Data was collected in 2016 making the datasets outdated for current trend analysis. Additionally, while the data initially states a time range of 03-12-2016 to 05-12-2016, I am considering only the dataset with most recent available data (4/12/2016 to 5/12/2016).
Licensing
This dataset is under CC0: Public Domain license meaning the creator has waive his right to the work under the copyright law. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring.
The dataset has limitations:
Only 30 user data is available. The central limit theorem general rule of n≥30 applies. Limitations for this data exist due to the sample size and absence of key characteristics of the participants, such as gender, age, location, lifestyle.
The data follows an ROCCC approch:
• Reliability: The data is from 30 FitBit users who consented to the submission of personal tracker data and generated by from a distributed survey via Amazon Mechanical Turk.
• Originality : The data is from 30 FitBit users who agreed to a personal tracker, third party data collected using Amazon Mechanical Turk.
• Comprehensive: Data minute-level output for physical activity, heart rate, sleep monitoring, calories used, daily steps taken. The dataset is limited and most data is recorded during certain days of the week.
• Current : Information was gathered between March and May of 2016. Since the data is outdated in 2024, consumers’ current FitBit usage may have altered.
• Cited : CC0: Public Domain, dataset made available through Mobius
I will be choosing R for this project because, while I have some experience with tools like SQL and Excel, R is a new and exciting tool for me. My only exposure to R so far has been during this Certification course. R allows me to perform data cleaning, processing, and visualization efficiently within a single platform, making it an excellent choice for managing and presenting data insights effectively.
setwd("/cloud/project/BellaBeat/")
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("skimr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("plotly")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("readr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library("here")
## here() starts at /cloud/project
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library ("skimr")
library("janitor")
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library("dplyr")
library("ggplot2")
library("tidyr")
library("lubridate")
library("plotly")
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
base_path <- "/cloud/project/BellaBeat/"
daily_activity <- read_csv(file.path(base_path, "dailyActivity_merged.csv"))
## Rows: 940 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityDate
## dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_calories <- read_csv(file.path(base_path, "hourlyCalories_merged.csv"))
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_intensities <- read_csv(file.path(base_path, "hourlyIntensities_merged.csv"))
## Rows: 22099 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (3): Id, TotalIntensity, AverageIntensity
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hourly_steps <- read_csv(file.path(base_path, "hourlySteps_merged.csv"))
## Rows: 22099 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ActivityHour
## dbl (2): Id, StepTotal
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
daily_sleep <- read_csv(file.path(base_path, "sleepDay_merged.csv"))
## Rows: 410 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): SleepDay
## dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(daily_activity)
## # A tibble: 6 × 15
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 12-04-2016 13162 8.5 8.5
## 2 1503960366 13-04-2016 10735 6.97 6.97
## 3 1503960366 14-04-2016 10460 6.74 6.74
## 4 1503960366 15-04-2016 9762 6.28 6.28
## 5 1503960366 16-04-2016 12669 8.16 8.16
## 6 1503960366 17-04-2016 9705 6.48 6.48
## # ℹ 10 more variables: LoggedActivitiesDistance <dbl>,
## # VeryActiveDistance <dbl>, ModeratelyActiveDistance <dbl>,
## # LightActiveDistance <dbl>, SedentaryActiveDistance <dbl>,
## # VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>
head(hourly_calories)
## # A tibble: 6 × 3
## Id ActivityHour Calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 81
## 2 1503960366 4/12/2016 1:00:00 AM 61
## 3 1503960366 4/12/2016 2:00:00 AM 59
## 4 1503960366 4/12/2016 3:00:00 AM 47
## 5 1503960366 4/12/2016 4:00:00 AM 48
## 6 1503960366 4/12/2016 5:00:00 AM 48
head(hourly_intensities)
## # A tibble: 6 × 4
## Id ActivityHour TotalIntensity AverageIntensity
## <dbl> <chr> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 20 0.333
## 2 1503960366 4/12/2016 1:00:00 AM 8 0.133
## 3 1503960366 4/12/2016 2:00:00 AM 7 0.117
## 4 1503960366 4/12/2016 3:00:00 AM 0 0
## 5 1503960366 4/12/2016 4:00:00 AM 0 0
## 6 1503960366 4/12/2016 5:00:00 AM 0 0
head(hourly_steps)
## # A tibble: 6 × 3
## Id ActivityHour StepTotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
head(daily_sleep)
## # A tibble: 6 × 5
## Id SleepDay TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 04-12-2016 00:… 1 327 346
## 2 1503960366 4/13/2016 12:0… 2 384 407
## 3 1503960366 4/15/2016 12:0… 1 412 442
## 4 1503960366 4/16/2016 12:0… 2 340 367
## 5 1503960366 4/17/2016 12:0… 1 700 712
## 6 1503960366 4/19/2016 12:0… 1 304 320
str(daily_activity)
## spc_tbl_ [940 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:940] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr [1:940] "12-04-2016" "13-04-2016" "14-04-2016" "15-04-2016" ...
## $ TotalSteps : num [1:940] 13162 10735 10460 9762 12669 ...
## $ TotalDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 6.48 8.59 9.88 6.68 6.34 ...
## $ TrackerDistance : num [1:940] 8.5 6.97 6.74 6.28 8.16 6.48 8.59 9.88 6.68 6.34 ...
## $ LoggedActivitiesDistance: num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num [1:940] 1.88 1.57 2.44 2.14 2.71 3.19 3.25 3.53 1.96 1.34 ...
## $ ModeratelyActiveDistance: num [1:940] 0.55 0.69 0.4 1.26 0.41 0.78 0.64 1.32 0.48 0.35 ...
## $ LightActiveDistance : num [1:940] 6.06 4.71 3.91 2.83 5.04 2.51 4.71 5.03 4.24 4.65 ...
## $ SedentaryActiveDistance : num [1:940] 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : num [1:940] 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : num [1:940] 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : num [1:940] 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : num [1:940] 728 776 1218 726 773 ...
## $ Calories : num [1:940] 1985 1797 1776 1745 1863 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityDate = col_character(),
## .. TotalSteps = col_double(),
## .. TotalDistance = col_double(),
## .. TrackerDistance = col_double(),
## .. LoggedActivitiesDistance = col_double(),
## .. VeryActiveDistance = col_double(),
## .. ModeratelyActiveDistance = col_double(),
## .. LightActiveDistance = col_double(),
## .. SedentaryActiveDistance = col_double(),
## .. VeryActiveMinutes = col_double(),
## .. FairlyActiveMinutes = col_double(),
## .. LightlyActiveMinutes = col_double(),
## .. SedentaryMinutes = col_double(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(hourly_calories)
## spc_tbl_ [22,099 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ Calories : num [1:22099] 81 61 59 47 48 48 48 47 68 141 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityHour = col_character(),
## .. Calories = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(hourly_intensities)
## spc_tbl_ [22,099 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour : chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ TotalIntensity : num [1:22099] 20 8 7 0 0 0 0 0 13 30 ...
## $ AverageIntensity: num [1:22099] 0.333 0.133 0.117 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityHour = col_character(),
## .. TotalIntensity = col_double(),
## .. AverageIntensity = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(hourly_steps)
## spc_tbl_ [22,099 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:22099] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour: chr [1:22099] "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ StepTotal : num [1:22099] 373 160 151 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. ActivityHour = col_character(),
## .. StepTotal = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str(daily_sleep)
## spc_tbl_ [410 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Id : num [1:410] 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr [1:410] "04-12-2016 00:00" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : num [1:410] 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: num [1:410] 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : num [1:410] 346 407 442 367 712 320 377 364 384 449 ...
## - attr(*, "spec")=
## .. cols(
## .. Id = col_double(),
## .. SleepDay = col_character(),
## .. TotalSleepRecords = col_double(),
## .. TotalMinutesAsleep = col_double(),
## .. TotalTimeInBed = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
I will make sure to check the number of unique users per data frame before coming on with the cleaning process.
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 24
sum(duplicated(daily_activity))
## [1] 0
sum(duplicated(daily_sleep))
## [1] 0
sum(duplicated(hourly_calories))
## [1] 0
sum(duplicated(hourly_intensities))
## [1] 0
sum(duplicated(hourly_steps))
## [1] 0
I will ensure that column names are in the right syntax and same format in all datasets since datasets will be merged later on.I will are change all columns to lower case format.
clean_names(daily_activity)
## # A tibble: 940 × 15
## id activity_date total_steps total_distance tracker_distance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 12-04-2016 13162 8.5 8.5
## 2 1503960366 13-04-2016 10735 6.97 6.97
## 3 1503960366 14-04-2016 10460 6.74 6.74
## 4 1503960366 15-04-2016 9762 6.28 6.28
## 5 1503960366 16-04-2016 12669 8.16 8.16
## 6 1503960366 17-04-2016 9705 6.48 6.48
## 7 1503960366 18-04-2016 13019 8.59 8.59
## 8 1503960366 19-04-2016 15506 9.88 9.88
## 9 1503960366 20-04-2016 10544 6.68 6.68
## 10 1503960366 21-04-2016 9819 6.34 6.34
## # ℹ 930 more rows
## # ℹ 10 more variables: logged_activities_distance <dbl>,
## # very_active_distance <dbl>, moderately_active_distance <dbl>,
## # light_active_distance <dbl>, sedentary_active_distance <dbl>,
## # very_active_minutes <dbl>, fairly_active_minutes <dbl>,
## # lightly_active_minutes <dbl>, sedentary_minutes <dbl>, calories <dbl>
daily_activity <- rename_with(daily_activity, tolower)
clean_names(hourly_calories)
## # A tibble: 22,099 × 3
## id activity_hour calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 81
## 2 1503960366 4/12/2016 1:00:00 AM 61
## 3 1503960366 4/12/2016 2:00:00 AM 59
## 4 1503960366 4/12/2016 3:00:00 AM 47
## 5 1503960366 4/12/2016 4:00:00 AM 48
## 6 1503960366 4/12/2016 5:00:00 AM 48
## 7 1503960366 4/12/2016 6:00:00 AM 48
## 8 1503960366 4/12/2016 7:00:00 AM 47
## 9 1503960366 4/12/2016 8:00:00 AM 68
## 10 1503960366 4/12/2016 9:00:00 AM 141
## # ℹ 22,089 more rows
hourly_calories <- rename_with(hourly_calories, tolower)
clean_names(hourly_intensities)
## # A tibble: 22,099 × 4
## id activity_hour total_intensity average_intensity
## <dbl> <chr> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 20 0.333
## 2 1503960366 4/12/2016 1:00:00 AM 8 0.133
## 3 1503960366 4/12/2016 2:00:00 AM 7 0.117
## 4 1503960366 4/12/2016 3:00:00 AM 0 0
## 5 1503960366 4/12/2016 4:00:00 AM 0 0
## 6 1503960366 4/12/2016 5:00:00 AM 0 0
## 7 1503960366 4/12/2016 6:00:00 AM 0 0
## 8 1503960366 4/12/2016 7:00:00 AM 0 0
## 9 1503960366 4/12/2016 8:00:00 AM 13 0.217
## 10 1503960366 4/12/2016 9:00:00 AM 30 0.5
## # ℹ 22,089 more rows
hourly_intensities <- rename_with(hourly_intensities, tolower)
clean_names(hourly_steps)
## # A tibble: 22,099 × 3
## id activity_hour step_total
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
## 7 1503960366 4/12/2016 6:00:00 AM 0
## 8 1503960366 4/12/2016 7:00:00 AM 0
## 9 1503960366 4/12/2016 8:00:00 AM 250
## 10 1503960366 4/12/2016 9:00:00 AM 1864
## # ℹ 22,089 more rows
hourly_steps <- rename_with(hourly_steps, tolower)
clean_names(daily_sleep)
## # A tibble: 410 × 5
## id sleep_day total_sleep_records total_minutes_asleep total_time_in_bed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1.50e9 04-12-20… 1 327 346
## 2 1.50e9 4/13/201… 2 384 407
## 3 1.50e9 4/15/201… 1 412 442
## 4 1.50e9 4/16/201… 2 340 367
## 5 1.50e9 4/17/201… 1 700 712
## 6 1.50e9 4/19/201… 1 304 320
## 7 1.50e9 4/20/201… 1 360 377
## 8 1.50e9 4/21/201… 1 325 364
## 9 1.50e9 4/23/201… 1 361 384
## 10 1.50e9 4/24/201… 1 430 449
## # ℹ 400 more rows
daily_sleep <- rename_with(daily_sleep, tolower)
head(daily_activity)
## # A tibble: 6 × 15
## id activitydate totalsteps totaldistance trackerdistance
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 12-04-2016 13162 8.5 8.5
## 2 1503960366 13-04-2016 10735 6.97 6.97
## 3 1503960366 14-04-2016 10460 6.74 6.74
## 4 1503960366 15-04-2016 9762 6.28 6.28
## 5 1503960366 16-04-2016 12669 8.16 8.16
## 6 1503960366 17-04-2016 9705 6.48 6.48
## # ℹ 10 more variables: loggedactivitiesdistance <dbl>,
## # veryactivedistance <dbl>, moderatelyactivedistance <dbl>,
## # lightactivedistance <dbl>, sedentaryactivedistance <dbl>,
## # veryactiveminutes <dbl>, fairlyactiveminutes <dbl>,
## # lightlyactiveminutes <dbl>, sedentaryminutes <dbl>, calories <dbl>
head(hourly_calories)
## # A tibble: 6 × 3
## id activityhour calories
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 81
## 2 1503960366 4/12/2016 1:00:00 AM 61
## 3 1503960366 4/12/2016 2:00:00 AM 59
## 4 1503960366 4/12/2016 3:00:00 AM 47
## 5 1503960366 4/12/2016 4:00:00 AM 48
## 6 1503960366 4/12/2016 5:00:00 AM 48
head(hourly_intensities)
## # A tibble: 6 × 4
## id activityhour totalintensity averageintensity
## <dbl> <chr> <dbl> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 20 0.333
## 2 1503960366 4/12/2016 1:00:00 AM 8 0.133
## 3 1503960366 4/12/2016 2:00:00 AM 7 0.117
## 4 1503960366 4/12/2016 3:00:00 AM 0 0
## 5 1503960366 4/12/2016 4:00:00 AM 0 0
## 6 1503960366 4/12/2016 5:00:00 AM 0 0
head(hourly_steps)
## # A tibble: 6 × 3
## id activityhour steptotal
## <dbl> <chr> <dbl>
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
head(daily_sleep)
## # A tibble: 6 × 5
## id sleepday totalsleeprecords totalminutesasleep totaltimeinbed
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1503960366 04-12-2016 00:… 1 327 346
## 2 1503960366 4/13/2016 12:0… 2 384 407
## 3 1503960366 4/15/2016 12:0… 1 412 442
## 4 1503960366 4/16/2016 12:0… 2 340 367
## 5 1503960366 4/17/2016 12:0… 1 700 712
## 6 1503960366 4/19/2016 12:0… 1 304 320
In daily_sleep the date formats are not consistent, so we are making all in a single format. For our hourly_calories, hourly_intensities and hourly_steps dataset, I will convert date string to date-time. For daily_activity and daily_sleep we are converting using as.Date format.
##daily_sleep
daily_sleep <- daily_sleep %>%
mutate(
sleepday = case_when(
grepl("\\d+/\\d+/\\d+ \\d+:\\d+:\\d+ [APap][Mm]", sleepday) ~
as.character(as.POSIXct(sleepday, format = "%m/%d/%Y %I:%M:%S %p", tz = Sys.timezone())),
grepl("\\d{2}-\\d{2}-\\d{4} \\d{2}:\\d{2}", sleepday) ~
as.character(as.POSIXct(sleepday, format = "%m-%d-%Y %H:%M", tz = Sys.timezone())),
TRUE ~ NA_character_ # Mark rows with unrecognized formats as NA
)
)
daily_sleep$sleepday <- as.Date(daily_sleep$sleepday, format = "%Y-%m-%d")
daily_sleep <- daily_sleep %>%
rename(date = sleepday)
class(daily_sleep$date)
## [1] "Date"
##Daily_activity
daily_activity$activitydate <- as.Date(daily_activity$activitydate, format = "%d-%m-%Y")
daily_activity <- daily_activity %>%
rename(date = activitydate)
##hourly_calories
hourly_calories <- hourly_calories %>%
rename(date_time = activityhour) %>%
mutate(date_time = as.POSIXct(date_time, format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
class(hourly_calories$date_time)
## [1] "POSIXct" "POSIXt"
##hourly_intensities
hourly_intensities <- hourly_intensities %>%
rename(date_time = activityhour) %>%
mutate(date_time = as.POSIXct(date_time, format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
class(hourly_intensities$date_time)
## [1] "POSIXct" "POSIXt"
##hourly_steps
hourly_steps<- hourly_steps %>%
rename(date_time = activityhour) %>%
mutate(date_time = as.POSIXct(date_time, format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
class(hourly_steps$date_time)
## [1] "POSIXct" "POSIXt"
activity_summary <- daily_activity %>%
group_by(id) %>%
summarise(
veryactiveminutes = sum(veryactiveminutes, na.rm = TRUE),
fairlyactiveminutes = sum(fairlyactiveminutes, na.rm = TRUE),
lightlyactiveminutes = sum(lightlyactiveminutes, na.rm = TRUE),
sedentaryminutes = sum(sedentaryminutes, na.rm =TRUE)
)
activity_summary <- activity_summary %>%
select(veryactiveminutes, fairlyactiveminutes, lightlyactiveminutes, sedentaryminutes) %>%
pivot_longer(cols = everything(), names_to = "ActivityType", values_to = "Minutes") %>%
mutate(Percentage = (Minutes / sum(Minutes)) * 100)
head(activity_summary)
## # A tibble: 6 × 3
## ActivityType Minutes Percentage
## <chr> <dbl> <dbl>
## 1 veryactiveminutes 1200 0.105
## 2 fairlyactiveminutes 594 0.0518
## 3 lightlyactiveminutes 6818 0.595
## 4 sedentaryminutes 26293 2.30
## 5 veryactiveminutes 269 0.0235
## 6 fairlyactiveminutes 180 0.0157
plot_ly(activity_summary, labels = ~ActivityType, values = ~Minutes,
type = 'pie',textposition = 'outside', textinfo = 'percent',
text = ~paste(ActivityType, ": ", round(Percentage, 2), "%"),
marker = list(colors = c('#AADEA7', '#64C2A6', '#F6AB49', '#F66D44')),
width = 500, # Set the width of the chart
height = 500 # Set the height of the chart
) %>%
layout(
title = 'Percentage Distribution of Active Minutes'
)
Key Observations from the Chart:
Percentage of active minutes in the four categories: very active, fairly active, lightly active and sedentary.
The major issue here is the large portion of time spent sedentary and the minimal time spent in high-intensity activities (very active and fairly active).
This pattern is common in modern lifestyles but can have serious consequences for long-term health. Gradually increase the percentage of time spent in moderate-to-vigorous activities.
calories_per_distance <- daily_activity %>%
mutate(WeekDay = weekdays(date)) %>%
mutate(WeekDay = ordered(WeekDay, levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))) %>%
group_by(WeekDay) %>%
summarise(
calories_per_day = sum(calories, na.rm = TRUE),
distance_per_day = sum(totaldistance, na.rm = TRUE),
calories_per_distance = round(sum(calories, na.rm = TRUE) / sum(totaldistance, na.rm = TRUE), 2)
)
head(calories_per_distance)
## # A tibble: 6 × 4
## WeekDay calories_per_day distance_per_day calories_per_distance
## <ord> <dbl> <dbl> <dbl>
## 1 Monday 278905 666. 419.
## 2 Tuesday 358114 886. 404.
## 3 Wednesday 345393 823. 420.
## 4 Thursday 323337 781. 414.
## 5 Friday 293805 669. 439.
## 6 Saturday 292016 726. 402.
ggplot(calories_per_distance, aes(x = WeekDay, y = calories_per_distance, group = 1)) +
geom_line(color = "darkblue", linewidth = 1.2) +
geom_point(color = "lightblue", size = 3) +
scale_y_continuous(labels = scales::comma) +
labs(
title = "Total Calories Burned Per Distance by Weekday",
x = "Weekday",
y = "Calories Per Distance"
)
Key Observations from the Chart:
sleep_data_summary <- daily_sleep %>%
mutate(Weekdays = wday(`date`, label = TRUE, week_start = 1)) %>%
group_by(Weekdays) %>%
summarize(
Avg_Minutes_Asleep = mean(totalminutesasleep),
Avg_Time_In_Bed = mean(totaltimeinbed)
) %>%
arrange(Weekdays) %>%
pivot_longer(
cols = c(Avg_Minutes_Asleep, Avg_Time_In_Bed),
names_to = "Category",
values_to = "Value"
) %>%
mutate(Value = round(Value, 2)) # Round the Value column
head(sleep_data_summary)
## # A tibble: 6 × 3
## Weekdays Category Value
## <ord> <chr> <dbl>
## 1 Mon Avg_Minutes_Asleep 420.
## 2 Mon Avg_Time_In_Bed 457.
## 3 Tue Avg_Minutes_Asleep 405.
## 4 Tue Avg_Time_In_Bed 443.
## 5 Wed Avg_Minutes_Asleep 435.
## 6 Wed Avg_Time_In_Bed 470.
ggplot(sleep_data_summary, aes(x = Weekdays, y = Value, fill = Category)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
scale_fill_manual(values = c("Avg_Minutes_Asleep" = "blue", "Avg_Time_In_Bed" = "orange")) +
scale_y_continuous(labels = scales::label_number(accuracy = 1)) + # Fully qualified function call
labs(
title = "Sleep Behavior by Day of the Week",
x = "Day of Week",
y = "Minutes",
fill = "Legend"
)
Key Observations from the Chart:
weekday_steps <- daily_activity %>%
mutate(weekday = weekdays(date))
weekday_steps$weekday <-ordered(weekday_steps$weekday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
weekday_steps <-weekday_steps %>%
group_by(weekday) %>%
summarize (daily_steps = sprintf("%.2f", mean(totalsteps)))
head(weekday_steps)
## # A tibble: 6 × 2
## weekday daily_steps
## <ord> <chr>
## 1 Monday 7780.87
## 2 Tuesday 8125.01
## 3 Wednesday 7559.37
## 4 Thursday 7405.84
## 5 Friday 7448.23
## 6 Saturday 8152.98
ggplot(data = weekday_steps, aes(x = weekday, y = as.numeric(daily_steps))) +
geom_bar(stat = "identity",fill = 'lightblue',width = 0.8) +
geom_hline(yintercept = 7500)+
labs(
title = "Weekly Average Steps Distribution",
x = "Weekday",
y = "Daily Steps"
) +
scale_y_continuous(breaks = seq(0, 10000, by = 2000)) # Adjust breaks for rounded values
Key Observations from the Chart:
hourly_steps <- hourly_steps %>%
mutate(
date = format(date_time, "%Y-%m-%d"),
time = format(date_time, "%H:%M:%S")
) %>%
mutate(date = ymd(date)) %>%
select(id, date, time, steptotal) %>%
group_by(time) %>%
summarize(average_steps = mean(steptotal)) # Calculate average steps for each time
head(hourly_steps)
## # A tibble: 6 × 2
## time average_steps
## <chr> <dbl>
## 1 00:00:00 42.2
## 2 01:00:00 23.1
## 3 02:00:00 17.1
## 4 03:00:00 6.43
## 5 04:00:00 12.7
## 6 05:00:00 43.9
ggplot(hourly_steps, aes(x = time, y = average_steps, fill = average_steps)) + # Map fill to average_steps
geom_bar(stat = "identity") +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(
title = "Hourly Step Activity",
x = "Time of Day",
y = "Average Steps"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x-axis labels for readability
Key Observations from the Chart:
active_minutes_calories <- daily_activity %>%
group_by(id, date) %>%
summarise(
Total_Very_Active_Minutes = sum(veryactiveminutes, na.rm = TRUE), # Sum of Very Active Minutes
Total_Calories = sum(calories, na.rm = TRUE) # Sum of Calories
) %>%
ungroup() %>%
mutate(
Calories_Per_VeryActiveMinute = if_else(Total_Very_Active_Minutes == 0, NA_real_, Total_Calories / Total_Very_Active_Minutes),
Calories_Per_VeryActiveMinute = Total_Calories / Total_Very_Active_Minutes # Calculate calories per very active minute
)
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
head(active_minutes_calories)
## # A tibble: 6 × 5
## id date Total_Very_Active_Mi…¹ Total_Calories Calories_Per_VeryAct…²
## <dbl> <date> <dbl> <dbl> <dbl>
## 1 1.50e9 2016-04-12 25 1985 79.4
## 2 1.50e9 2016-04-13 21 1797 85.6
## 3 1.50e9 2016-04-14 30 1776 59.2
## 4 1.50e9 2016-04-15 29 1745 60.2
## 5 1.50e9 2016-04-16 36 1863 51.8
## 6 1.50e9 2016-04-17 38 1728 45.5
## # ℹ abbreviated names: ¹Total_Very_Active_Minutes,
## # ²Calories_Per_VeryActiveMinute
active_minutes_calories %>%
ggplot(aes(x = Total_Very_Active_Minutes, y = Total_Calories, color = Calories_Per_VeryActiveMinute)) +
geom_point() +
scale_color_gradient(low = "lightblue", high = "darkblue") +
labs(title = "Correlation Between Active Minutes and Calorie Burn",
x = "Total Very Active Minutes",
y = "Total Calories Burned") +
theme_minimal()
Key Observations from the Chart:
The analysis highlights key patterns in activity and sleep behaviour. Prolonged sedentary time (81.3%) and insufficient high-intensity activities increase health risks, emphasizing the need to reduce inactivity. Sleep duration averages 6.7–7.5 hours, slightly below the recommended 7–9 hours, suggesting a focus on better sleep hygiene to improve rest quality. Step counts generally meet or exceed the 7,500-step baseline, with peak activity in the evening (5–7 PM). While Sundays are restful with improved sleep, they are marked by reduced activity, underscoring the importance of maintaining consistent physical activity throughout the week. This summary encapsulates the overall findings based on the insights derived from the data analysis.
Personalized Activity and Lifestyle Goals Set activity and step targets based on users’ past performance, lifestyle, or preferences, customizable during sign-up. Offer calorie-tracking features or integrate with third-party apps to align physical activity data with dietary goals. Provide meal suggestions or personalized diet plans and include hydration reminders to encourage a balanced lifestyle.
Sedentary and Heart Rate Alerts Notify users to engage in light activities after prolonged inactivity with prompts like, “You’ve been sedentary for 2 hours; take a 5-minute walk!”. Additionally, monitor heart rate and send alerts for abnormal fluctuations, ensuring timely action if thresholds are exceeded.
Sleep and Recovery Insights Provide tailored advice for improving sleep routines, such as setting a regular bedtime, reducing screen time, or relaxing before sleep. Include tips for creating a restful environment and maintaining consistent recovery practices.
Weekly Dashboard Summarize weekly activity, calories burned, and sleep performance with improvement tips in a user-friendly dashboard. Share a detailed summary every Sunday to help users track progress and adopt healthier habits.
Motivational Features and Social Engagement Incorporate gamified elements like streaks, badges, and friendly competitions to keep users motivated. Include period tracking features to help users monitor their menstrual cycles, including reminders for expected periods, ovulation, and tips for managing symptoms.
Implementing these recommendations can significantly enhance the Bellabeat app’s ability to support users in achieving a healthier lifestyle. By offering the above recommendations the app can create a holistic experience that fosters long-term habits. These features not only improve user satisfaction but also position Bellabeat as a comprehensive wellness companion.